NON LINEAR REGRESSION MODELS¶

Objective¶

This homework sheet will help reviewing the basic concepts associated with non-linear models. Please review the lectures, suggested readings, and additional resources before getting started on the HW.

Some questions in this assignment will require you to conduct independent research beyond the material covered in the recorded content.

Some of the libraries / links mentioned in APPLIED questions are only for your reference. Unless specifically indicated, feel free to utilise any Python library of your choice to carry out the assignment questions

PYGAM REFERENCE LINK:

https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html

STATSMODELS REFERENCE LINK:

https://www.statsmodels.org/dev/glm.html

PATSY REFERENCE LINK:

https://patsy.readthedocs.io/en/latest/

The following website has access to the relevant datasets from the recommended textbook: https://book.huihoo.com/introduction-to-statistical-learning/data.html

Marks Distribution

Question Marks
1a 0.5
1b 0.5
1c 0.5
1d 0.5
1e 0.5
2 1
3 1
4 1
5a 0.5
5b 0.5
5c 0.5
6a 2
6b 1
7a 1
7b 2
7c 1
7d 1

Conceptual¶

Question 1¶

Suppose that a curve $\hat{g}$ is computed to smoothly fit a set of n points using the following formula:

$$ \hat{g} = \arg\min_{g}(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(m)} (x)]^2 dx ) $$ where $g^{(m)}$ represents the $m^{th}$ derivative of $g$ (and $g^{(0)} = g$. Provide a description (or sketch) of $\hat{g}$ in each of the following scenarios. where $g^{(m)}$ represents the m th derivative of g

(a) $\lambda$ = $\infty$, $m = 0$.

ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\lambda \int [g(x)]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. We can say the $\hat{g}$ is a function which minimises the absolute value of the function itself. Hence the $ \hat{g}(x) = 0 $ will be the most likely function.

(b) $\lambda$ = $\infty$, $m = 1$.

ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\int [g(x)^{(1)}]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. With $\lambda = \infty$ and $m = 1$, the penalty term $\lambda \int [g^{(1)}(x)]^2 dx$ enforces that the first derivative of $g(x)$ is minimized, which means that $g(x)$ should be as flat as possible (since the integral of the square of the first derivative of a flat function over its domain is minimized by being flat). So, $\hat{g}$ would be a horizontal line that best fits the data points. Hence the $\hat{g}(x) = \bar{y}$ where $\bar{y}$ is mean of all observations($y_{i}$) will be the most likely function provided the g(x) is polynomial. .

(c) $\lambda$ = $\infty$, $m = 2$.

ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\int [g(x)^{(2)}]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. Here, the penalty term $\lambda \int [g^{(2)}(x)]^2 dx$ enforces that the second derivative of $g(x)$ is minimized, which means that $g(x)$ should be a straight line (since the integral of the square of the second derivative of a straight line over its domain is minimized by being a straight line- i.e,$ [g(x)^{(2)}]^2 = 0 $). So, $\hat{g}$ would be a straight line that best fits the data points. Hence the $\hat{g}(x) = \beta_0 + \beta_1 x$ with optimum $\beta_1 and \beta_2 $ will be the most likely function provided the g(x) is polynomial.

(d) $\lambda$ = $\infty$, $m = 3$.

ANSWER Here we have tuning parameter as $\infty$, hence penality term becomes more influential in the optimization process. In this scenario, the penalty term $\int [g(x)^{(3)}]^2 dx$ is highly optimised and the fitting the points(RSS) become less significant. the penalty term $\lambda \int [g^{(m)}(x)]^2 dx$ enforces that the third derivative of $g(x)$ is minimized, which means that $g(x)$ should be a simple curve without too much curvature. So, $\hat{g}$ would be a curve that best fits the data points without overly complex shapes. This will most likey will have $[g(x)^{(3)}]^2 = o$ for all x. Hence the equation will mostly of the form $\hat{g}(x) = \beta_0 + \beta_1 x + \beta_2 x^{2}$ with optimum $\beta_i$s (Assuming g(x) is polynomial)

(e) $\lambda$ = 0, $m = 3$.

ANSWER When $\lambda = 0$, the regularization term has no effect, and the model reduces to minimizing the sum of squared errors(RSS) without any penalty for the curvature of the curve. So, $\hat{g}$ would be a curve that closely fits the data points, potentially leading to overfitting if the model is too complex.

Question 2¶

For (1) step functions, (2) local regression, and (3) GAMs: provide one situation when using that particular model is advantageous. For instance, “GAMs is likely the preferred option when the question I’m interested in addressing involves…”.

ANSWER

  1. Step Functions: Step functions are useful when you want to represent a predictor's effect that changes abruptly at certain thresholds. By using step functions, we can fit different polynomial functions to different regions of the dataset, allowing the regression model to capture nonlinear relationships more accurately. For example, in marketing, you might use a step function to model the effect of price changes on demand, where the demand remains constant until a price threshold is reached, after which it drops suddenly.

  2. Local Regression: Local regression, such as LOESS (locally estimated scatterplot smoothing), is advantageous when you have non-linear relationships in your data but want a more flexible approach than global polynomial regression. Local regression is a non-parametric regression technique used to model the relationship between variables when the relationship is not assumed to be linear. It provides a flexible way to fit a smooth curve to the data by assigning higher weights to nearby data points and lower weights to more distant points. It is a memory-based procedure because we need all the training data each time we wish to compute a prediction

  3. Generalized Additive Models (GAMs): GAMs are advantageous when the relationship between the predictor and the response variable is non-linear and you want a flexible model that can capture complex interactions and non-linearities without specifying the exact form of the relationship. GAMs are also useful when you have multiple predictors and you want to understand their individual effects on the response variable while accounting for potential interactions. The main limitation of GAMs is that the model is restricted to be additive. However, we can manually add interaction terms to the GAM model by including additional predictors of the form $X_j x X_k$. In GAMs, the relationship between each predictor variable and the response is modeled using smooth functions, allowing for more flexibility and capturing intricate data patterns.

Question 3¶

Define the following types of smooth functions from PYGAM library (https://pygam.readthedocs.io/en/latest/notebooks/tour_of_pygam.html ) and list at least one relevant potential use:

a) l() linear terms

b) s() spline terms

c) f() factor terms

d) te() Tensor product interaction terms

ANSWER In the context of the PYGAM library for generalized additive models (GAMs), each function type serves a specific purpose in modeling nonlinear relationships between predictors and the target variable.

a) l(feature, lam=0.6, penalties='auto', verbose=False) linear terms: These are simple linear functions of a single predictor. They model a linear relationship between the predictor and the target variable.

Example: Suppose you're modeling the relationship between temperature and energy consumption. You might use a linear term to model how energy consumption changes with temperature in a linear manner.

b) s(feature, n_splines=20, spline_order=3, lam=0.6, penalties='auto', constraints=None, dtype='numerical', basis='ps', by=None, edge_knots=None, verbose=False) spline terms: Splines are flexible functions that allow for nonlinear relationships with the predictor. They are especially useful when the relationship is not linear but still smooth.

Example: In a study of the effect of age on income, you might use a spline term to capture the nonlinear relationship, as income might increase rapidly in early career stages, level off, and potentially decline post-retirement.

c) f(feature, lam=0.6, penalties='auto', coding='one-hot', verbose=False) factor terms: Factor terms are used for categorical predictors. They allow each category to have its own effect on the target variable, which is particularly useful when the effect of the category is not linear or when there are interactions between categories.

Example: If you're studying the impact of different types of marketing strategies on sales, you might use factor terms for the different marketing channels to capture how each channel affects sales differently.

d) te() Tensor product interaction terms: Tensor product interactions allow for interactions between two or more predictors. They are useful when the effect of one predictor on the target variable depends on the value of another predictor.

Example: In a study of the effect of both temperature and humidity on plant growth, you might use a tensor product interaction term to capture how the effect of temperature on growth varies with different levels of humidity.

Each of these function types provides a way to model complex relationships in GAMs, allowing for more flexibility and accuracy in capturing the underlying patterns in the data.

Question 4¶

Briefly explain what spatial and temporal autocorrelation are? List one way in which GAMs could help to account for this (e.g. indicate the type of smooth function you would use or potential types of correlation structures).

ANSWER Spatial autocorrelation refers to the degree to which the values of a variable at nearby locations in space are similar to each other. Temporal autocorrelation, on the other hand, refers to the degree to which the values of a variable at nearby points in time are similar to each other.

Generalized additive models (GAMs) can help account for spatial or temporal autocorrelation by incorporating smooth functions of space or time. For spatial autocorrelation, one might use a smooth function of latitude and longitude. For temporal autocorrelation, one might use a smooth function of time. Additionally, one could also consider using structured correlation functions, such as the Matérn covariance function, to explicitly model the spatial or temporal correlation in the data.

Alternatively, one could use structured correlation functions, such as the exponential, spherical, or Matérn covariance functions, to model the spatial or temporal correlation explicitly. These functions define the correlation between two points as a function of their distance or time difference, allowing for more flexible modeling of autocorrelation structures.

Question 5¶

Consider two curves $\hat{g}1$ and $\hat{g}2$, defined by

$$ \hat{g}1 = \arg\min_{g}(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(3)} (x)]^2 dx ) $$

$$ \hat{g}2 = \arg\min_{g}(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int [g^{(4)} (x)]^2 dx ) $$

where $g^{(m)}$ represents the $m^{th}$ derivative of $g$. Now, answer the following questions (and include a brief explanation):

a) As $\lambda \rightarrow \infty$, will $\hat{g}1$ or $\hat{g}2$ have smaller training RSS?

ANSWER As $\lambda \rightarrow \infty$, the penalty term in both $\hat{g}1$ and $\hat{g}2$ will dominate the objective function. However, since $\hat{g}1$ includes a penalty on the third derivative of $g$, it will tend to produce smoother functions compared to $\hat{g}2$, which penalizes the fourth derivative. Smoother functions are less likely to overfit the training data well , so $\hat{g}2$ is likely to have a smaller training RSS as $\lambda \rightarrow \infty$ .

b) As $\lambda \rightarrow \infty$, will $\hat{g}1$ or $\hat{g}2$ have smaller test RSS?

ANSWER As $\lambda \rightarrow \infty$, the penalty term in both $\hat{g}1$ and $\hat{g}2$ will dominate the objective function. However, since $\hat{g}1$ includes a penalty on the third derivative of $g$, it will tend to produce smoother functions compared to $\hat{g}2$, which penalizes the fourth derivative. Smoother functions are more likely to fit the test data well without overfitting, so $\hat{g}1$ is likely to have a smaller test RSS as $\lambda \rightarrow \infty$ provided that the actual data is not having too complex patterns (which can increase bias).

c) For $\lambda= 0$, will $\hat{g}1$ or $\hat{g}2$ have smaller training RSS?

ANSWER For $\lambda = 0$, there is no penalty term in the objective function, so both $\hat{g}1$ and $\hat{g}2$ will try to fit the training data as closely as possible without any regularization. Hence the both $\hat{g}1$ and $\hat{g}2$ should result in same result. Hence both equations are same, we should have same test or training RSS

APPLIED¶

Question 6¶

This question relates to the College data set.

pre process the data if required

a) Fit a GAM on the training data, using out-of-state tuition as the response and only subset of features as the predictors ( Select 'private', 'Room.Board' , 'PhD' , 'perc.alumni' , 'Expend' , 'Grad.rate' ) Plot the results and briefly explain your findings

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

college = pd.read_csv('College.csv')
preprocessed_college = college.copy()
preprocessed_college['private'] = college['Private'].map({'Yes': 1, 'No': 0})
preprocessed_college
Out[1]:
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate private
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60 1
1 Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56 1
2 Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54 1
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59 1
4 Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
772 Worcester State College No 2197 1515 543 4 26 3089 2029 6797 3900 500 1200 60 60 21.0 14 4469 40 0
773 Xavier University Yes 1959 1805 695 24 47 2849 1107 11520 4960 600 1250 73 75 13.3 31 9189 83 1
774 Xavier University of Louisiana Yes 2097 1915 695 34 61 2793 166 6900 4200 617 781 67 75 14.4 20 8323 49 1
775 Yale University Yes 10705 2453 1317 95 99 5217 83 19840 6510 630 2115 96 96 5.8 49 40386 99 1
776 York College of Pennsylvania Yes 2989 1855 691 28 63 2988 1726 4990 3560 500 1250 75 75 18.1 28 4509 99 1

777 rows × 20 columns

In [2]:
preprocessed_college = preprocessed_college[['private', 'Room.Board' , 'PhD' , 'perc.alumni' , 'Expend' , 'Grad.Rate', 'Outstate']]
# checking for any anomaly in data
plt.figure(figsize=(16, 8))
sns.boxplot(data=preprocessed_college)
plt.show()
No description has been provided for this image
In [3]:
preprocessed_college.describe()
Out[3]:
private Room.Board PhD perc.alumni Expend Grad.Rate Outstate
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.00000 777.000000
mean 0.727156 4357.526384 72.660232 22.743887 9660.171171 65.46332 10440.669241
std 0.445708 1096.696416 16.328155 12.391801 5221.768440 17.17771 4023.016484
min 0.000000 1780.000000 8.000000 0.000000 3186.000000 10.00000 2340.000000
25% 0.000000 3597.000000 62.000000 13.000000 6751.000000 53.00000 7320.000000
50% 1.000000 4200.000000 75.000000 21.000000 8377.000000 65.00000 9990.000000
75% 1.000000 5050.000000 85.000000 31.000000 10830.000000 78.00000 12925.000000
max 1.000000 8124.000000 103.000000 64.000000 56233.000000 118.00000 21700.000000
In [4]:
sns.pairplot(data=preprocessed_college)
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
Out[4]:
<seaborn.axisgrid.PairGrid at 0x10777ad10>
No description has been provided for this image
In [5]:
!pip install pygam 
Requirement already satisfied: pygam in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (0.9.1)
Requirement already satisfied: numpy>=1.25 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from pygam) (1.25.2)
Requirement already satisfied: progressbar2<5.0.0,>=4.2.0 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from pygam) (4.4.2)
Requirement already satisfied: scipy<1.12,>=1.11.1 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from pygam) (1.11.3)
Requirement already satisfied: python-utils>=3.8.1 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from progressbar2<5.0.0,>=4.2.0->pygam) (3.8.2)
Requirement already satisfied: typing-extensions>3.10.0.2 in /Users/lngeorge/anaconda3/lib/python3.10/site-packages (from python-utils>=3.8.1->progressbar2<5.0.0,>=4.2.0->pygam) (4.7.1)
In [6]:
from pygam import LinearGAM
features = ['private', 'Room.Board', 'PhD', 'perc.alumni', 'Expend', 'Grad.Rate']
X = preprocessed_college[features]
Y = preprocessed_college["Outstate"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

lams = np.logspace(-3, 3, 20) * X_train.shape[1]
splines = list(range(5, 55, 5))

best_mse = np.inf
best_lambda = None
best_spline = 5
for lam in lams:
    for spline in splines:
        gam = LinearGAM(n_splines=spline, lam=lam).fit(X_train.values, y_train.values)
        y_pred = gam.predict(X_test.values)
        mse = mean_squared_error(y_test, y_pred)
        if mse < best_mse:
            best_mse = mse
            best_lambda = lam
            best_spline = spline

print(f"Best Lambda: {best_lambda}")
print(f"Best no of Splines: {best_spline}")
print(f"Best MSE: {best_mse}")

final_gam = LinearGAM(n_splines=best_spline, lam=best_lambda).fit(X_train.values, y_train.values)

# Partial Dependence Plots
for i, term in enumerate(final_gam.terms):
    if term.isintercept:
        continue

    XX = final_gam.generate_X_grid(term=i)
    pdep, confi = final_gam.partial_dependence(term=i, X=XX, width=0.95)

    # Get the feature name from the DataFrame
    feature_name = X.columns[i]

    # Plot partial dependence and confidence intervals
    plt.figure()
    plt.rcParams['figure.figsize'] = (10, 6)
    plt.plot(XX[:, i], pdep)
    plt.fill_between(XX[:, i], confi[:, 0], confi[:, 1], color='r', alpha=0.3)
    plt.title(f'Partial Dependence Plot for {feature_name}')
    plt.xlabel(feature_name)
    plt.ylabel('Predicted out-of-state Tuition')
    plt.show()
Best Lambda: 2.0158909717702684
Best no of Splines: 15
Best MSE: 3374555.644641011
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

We can see how the relationship changes for each feature

  • For private, we can see that the private collages have more out of state tution in general.
  • We can see that for room board, we can see that the relationship is not very linear, but seems to be increasing with room board in general.
  • We can see out of state tution is almost constant with PhD with small differences
  • perventage alumni also seems to have small positive correlation.
  • The plot with expediture seems to increase the out of state tution. But later it seems to stabilise as expenditure becomes more.
  • Grad rate seems to vary in a sine wave like manner increasing in first half and then decreasing in next half.

b) For which variables, if any, is there evidence of a non-linear relationship with the response?

From the above plots, Grad rate, Expend, percentage alumni, PhD, Room board seems to show very clear pattern which is not linear.

Question 7¶

Answer the questions below using the following dataset (WEEK3_Q7 CSV provided with the learner template):

In [7]:
df2 = pd.read_csv('Week3_Q7.csv')
df2.head(10)
Out[7]:
Year Route statenum richness abundance Active Latitude Longitude RouteTypeID
0 1966 2 25 41 491 1 30.375543 -86.275832 1
1 1966 3 25 40 292 1 30.519316 -86.480913 1
2 1966 4 25 51 491 1 30.528840 -85.133454 1
3 1966 6 25 52 851 1 30.595019 -84.041309 1
4 1966 10 25 56 458 1 30.273463 -83.856968 1
5 1966 11 25 43 360 1 29.668699 -83.378206 1
6 1966 12 25 45 344 1 30.502624 -82.733741 1
7 1966 13 25 57 599 0 29.546846 -82.263030 1
8 1966 14 25 52 485 1 29.191620 -82.394802 1
9 1966 16 25 46 1452 0 28.036316 -82.505005 1
In [8]:
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2824 entries, 0 to 2823
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Year         2824 non-null   int64  
 1   Route        2824 non-null   int64  
 2   statenum     2824 non-null   int64  
 3   richness     2824 non-null   int64  
 4   abundance    2824 non-null   int64  
 5   Active       2824 non-null   int64  
 6   Latitude     2824 non-null   float64
 7   Longitude    2824 non-null   float64
 8   RouteTypeID  2824 non-null   int64  
dtypes: float64(2), int64(7)
memory usage: 198.7 KB
In [9]:
 df2.nunique()
Out[9]:
Year             50
Route           123
statenum          1
richness         59
abundance      1000
Active            2
Latitude        119
Longitude       122
RouteTypeID       1
dtype: int64
In [10]:
df2.describe(include='all')
Out[10]:
Year Route statenum richness abundance Active Latitude Longitude RouteTypeID
count 2824.000000 2824.000000 2824.0 2824.000000 2824.000000 2824.000000 2824.000000 2824.000000 2824.0
mean 1995.283286 171.122167 25.0 47.113314 689.820113 0.830028 28.724434 -82.661525 1.0
std 13.169235 303.107324 0.0 9.066053 354.727766 0.375674 1.659540 1.839859 0.0
min 1966.000000 1.000000 25.0 19.000000 147.000000 0.000000 24.597703 -87.407943 1.0
25% 1987.000000 19.000000 25.0 41.000000 453.750000 1.000000 27.426641 -83.856968 1.0
50% 1997.000000 52.000000 25.0 47.000000 615.000000 1.000000 28.785686 -82.118666 1.0
75% 2006.000000 87.000000 25.0 53.000000 809.000000 1.000000 30.292951 -81.418950 1.0
max 2015.000000 925.000000 25.0 79.000000 4168.000000 1.000000 30.965694 -80.203285 1.0
In [11]:
import matplotlib.pyplot as plt
correlation_matrix = df2.drop(["statenum", "RouteTypeID"], axis=1).corr()
plt.figure(figsize=(16, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image
In [12]:
sns.pairplot(data=df2)
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/lngeorge/anaconda3/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
Out[12]:
<seaborn.axisgrid.PairGrid at 0x30a7aaef0>
No description has been provided for this image

a) Only by inspecting the plots shown below: Is there any temporal or spatial trend in the response variable richness?

In [13]:
sns.scatterplot(x=df2.Year,y=df2.richness)
Out[13]:
<Axes: xlabel='Year', ylabel='richness'>
No description has been provided for this image

ANSWER In the plot we cannot see a strong relationship for the response variable richness as the correlation seems to be -0.15. But we can see there is a week assosiation which makes the variance of data per line seems to increase over time and the trend seems to decrease the average value of richness over time. On detailed inspection we can see a wave like nature to the richness as the time moves on. This could be a indication of weak temperal association. We can also see the variation of the data in same year. This indicate there could be a spatial trend in the resonse variable with may increase with time.

b) Now, fit a Poisson GAM with the following structure E(richness) = exp(f(year))*exp(g(location)). Note that location is actually two features: Longitude and Latitude. Name this model M1. Plot the model, get the summary of M1, and answer: Is there evidence for effects of location and year?

In [14]:
X = df2[['Year', 'Longitude', 'Latitude']]
y = df2['richness']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [15]:
from pygam import PoissonGAM, s, te
import pandas as pd
import matplotlib.pyplot as plt

lams = np.logspace(-3, 3, 20) * X_train.shape[1]
splines = list(range(20, 28, 1))
term = s(0) + te(1, 2)
best_mse = np.inf
best_spline = 1
M1 = None
# for lam in lams:
for spline in splines:
    gam = PoissonGAM(term, n_splines=spline).gridsearch(X_train.values, y_train.values)
    y_pred = gam.predict(X_test.values)
    mse = mean_squared_error(y_test, y_pred)
    if mse < best_mse:
        M1 = gam
        best_mse = mse
        best_spline = spline

print(f"Best Lambda: {M1.lam}")
print(f"Best no of Splines: {best_spline}")
print(f"Best MSE: {best_mse}")
100% (11 of 11) |########################| Elapsed Time: 0:00:07 Time:  0:00:070:00
100% (11 of 11) |########################| Elapsed Time: 0:00:07 Time:  0:00:070:00
100% (11 of 11) |########################| Elapsed Time: 0:00:10 Time:  0:00:100:00
100% (11 of 11) |########################| Elapsed Time: 0:00:11 Time:  0:00:110:01
100% (11 of 11) |########################| Elapsed Time: 0:00:16 Time:  0:00:160:01
100% (11 of 11) |########################| Elapsed Time: 0:00:15 Time:  0:00:150:01
100% (11 of 11) |########################| Elapsed Time: 0:00:19 Time:  0:00:190:01
100% (11 of 11) |########################| Elapsed Time: 0:00:23 Time:  0:00:230:02
Best Lambda: [[0.015848931924611134], [[0.015848931924611134], [0.015848931924611134]]]
Best no of Splines: 24
Best MSE: 31.934568944096114
In [16]:
XX = M1.generate_X_grid(term=0)
pdep, confi = M1.partial_dependence(term=0, X=XX, width=0.95)
feature_name = X.columns[0]
plt.figure()
title = f'Predicted Richness'
plt.rcParams['figure.figsize'] = (10, 6)
plt.plot(XX[:, 0], pdep)
plt.fill_between(XX[:, 0], confi[:, 0], confi[:, 1], color='r', alpha=0.3)
plt.title(f'Partial Dependence Plot for {feature_name}')
plt.xlabel(feature_name)
plt.ylabel(title)
plt.show()

def genrate_partial_dependence_graph(M, term_index, predictor_columns=[], response_column="Richness"):
    feature_index = M.terms[term_index].feature
    XX = M.generate_X_grid(term=term_index)
    pdep, confi = M.partial_dependence(term=term_index, X=XX, width=0.95)
    # combined_data = np.concatenate((XX, pdep, confi), axis=1)
    if len(predictor_columns) == 0:
        predictor_columns = X.columns
    data = {predictor_columns[i]:  XX[:, i] for i in feature_index}
    data.update({response_column: pdep, 'confi_low': confi[:, 0], 'confi_high': confi[:, 1]})
    combined_df = pd.DataFrame(data)
    for index in feature_index:
        feature_name = predictor_columns[index]
        sorted_df = combined_df.sort_values(by=[feature_name])
        plt.figure()
        plt.rcParams['figure.figsize'] = (10, 6)
        plt.plot(sorted_df[feature_name], sorted_df[response_column])
        plt.fill_between(sorted_df[feature_name], sorted_df["confi_low"], sorted_df["confi_high"], color='r', alpha=0.3)
        plt.title(f'Partial Dependence Plot for {feature_name}')
        plt.xlabel(f'Predicted {response_column}')
        plt.ylabel(title)
        plt.show()
genrate_partial_dependence_graph(M1, 1, X.columns, "Richness")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [17]:
def generate_interaction_plot(M, i, predictor_columns=[], response_column="Richness"):
    XX = M.generate_X_grid(term=i, meshgrid=True)
    feature_index = M.terms[i].feature
    pdep, confi = M.partial_dependence(term=i, X=XX, width=0.95, meshgrid=True)
    if len(predictor_columns) == 0:
        predictor_columns = X.columns
    # print(pdep.shape, XX[0].shape)
    # Create a 3D plot
    fig = plt.figure(figsize=(16, 10))
    ax = fig.add_subplot(111, projection='3d')
    
    # Plot the surface
    surf = ax.plot_surface(XX[0].reshape(pdep.shape),
                            XX[1].reshape(pdep.shape),
                            pdep, cmap='viridis', edgecolor='none')
    
    # Add confidence intervals
    ax.plot_wireframe(XX[0].reshape(pdep.shape),
                       XX[1].reshape(pdep.shape),
                       confi[:, :, 0], color='r', alpha=0.3)
    ax.plot_wireframe(XX[0].reshape(pdep.shape),
                       XX[1].reshape(pdep.shape),
                       confi[:, :, 1], color='r', alpha=0.3)
    
    # Set labels and title
    ax.set_xlabel(predictor_columns[feature_index[0]])
    ax.set_ylabel(predictor_columns[feature_index[1]])
    ax.set_zlabel( f'Predicted {response_column}')
    ax.set_title(f'Partial Dependence Plot for {predictor_columns[feature_index[0]]} and {predictor_columns[feature_index[1]]}')
    ax.set_box_aspect(None, zoom=0.9)
    
    # Show the plot
    plt.show()
generate_interaction_plot(M1, 1, X.columns)
No description has been provided for this image
In [18]:
M1.summary()
PoissonGAM                                                                                                
=============================================== ==========================================================
Distribution:                       PoissonDist Effective DoF:                                    126.7811
Link Function:                          LogLink Log Likelihood:                                 -7142.2696
Number of Samples:                         2259 AIC:                                            14538.1014
                                                AICc:                                           14553.5512
                                                UBRE:                                                2.808
                                                Scale:                                                 1.0
                                                Pseudo R-Squared:                                   0.6373
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [0.0158]             24           24.0         2.24e-10     ***         
te(1, 2)                          [0.0158 0.0158]      576          102.8        0.00e+00     ***         
intercept                                              1            0.0          1.59e-09     ***         
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.
/var/folders/fq/t1vhfgh95v33mglpnsqgnyyh0000gq/T/ipykernel_9609/3019879061.py:1: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  M1.summary()

Yes. We can see how the richness show a wave like variation with respect to time and we can see how the location affects the richness. We can note that the confidence increses in area where we have more data. Also note that p-value is significantly low for the feature functions. Hence the evidence is clearly present.

c) Now, fit another Poisson GAM model with the same structure as in M1 , name this model M2 but using interactions between Year, Latitude and Longitude

In [19]:
from pygam import PoissonGAM, s, te
import pandas as pd
import matplotlib.pyplot as plt

splines = list(range(6, 12, 1))
term = te(0, 1, 2)
best_mse = np.inf
best_spline = 1
M2 = None
# for lam in lams:
for spline in splines:
    gam = PoissonGAM(term, n_splines=spline).gridsearch(X_train.values, y_train.values)
    y_pred = gam.predict(X_test.values)
    mse = mean_squared_error(y_test, y_pred)
    if mse < best_mse:
        M2 = gam
        best_mse = mse
        best_spline = spline
print(f"Best Lambda: {M2.lam}")
print(f"Best no of Splines: {best_spline}")
print(f"Best MSE: {best_mse}")
100% (11 of 11) |########################| Elapsed Time: 0:00:03 Time:  0:00:030:00
100% (11 of 11) |########################| Elapsed Time: 0:00:06 Time:  0:00:060:00
100% (11 of 11) |########################| Elapsed Time: 0:00:10 Time:  0:00:100:00
100% (11 of 11) |########################| Elapsed Time: 0:00:23 Time:  0:00:230:02
100% (11 of 11) |########################| Elapsed Time: 0:00:39 Time:  0:00:390:03
100% (11 of 11) |########################| Elapsed Time: 0:01:10 Time:  0:01:100:06
Best Lambda: [[[0.001], [0.001], [0.001]]]
Best no of Splines: 9
Best MSE: 34.875029481790506
In [20]:
def generate_interaction_plot_4d(M, i, predictor_columns=[], response_column="Richness"):
    XX = M.generate_X_grid(term=i, meshgrid=True)
    feature_index = M.terms[i].feature
    pdep, confi = M.partial_dependence(term=i, X=XX, width=0.95, meshgrid=True)
    # print(confi.shape)
    # print(confi)
    if len(predictor_columns) == 0:
        predictor_columns = X.columns
    
    # Create a 4D plot
    fig = plt.figure(figsize=(16, 10))
    ax = fig.add_subplot(111, projection='3d')
    
    # Plot the surface
    surf = ax.scatter(XX[1].reshape(pdep.shape),
                       XX[2].reshape(pdep.shape),
                       pdep, c=XX[0].reshape(pdep.shape), cmap='viridis', edgecolor='none')
    
    # Set labels and title
    ax.set_xlabel(predictor_columns[feature_index[1]])
    ax.set_ylabel(predictor_columns[feature_index[2]])
    ax.set_zlabel(f'Predicted {response_column}')
    ax.set_title(f'Partial Dependence Plot for {predictor_columns[feature_index[0]]} and {predictor_columns[feature_index[1]]}')
    ax.set_box_aspect(None, zoom=0.9)
    
    # Add color bar
    cbar = plt.colorbar(surf)
    cbar.set_label(predictor_columns[feature_index[0]])
    
    # Show the plot
    plt.show()
generate_interaction_plot_4d(M2, 0, X.columns)
genrate_partial_dependence_graph(M2, 0, X.columns, "Richness")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [21]:
M2.summary()
PoissonGAM                                                                                                
=============================================== ==========================================================
Distribution:                       PoissonDist Effective DoF:                                    275.2661
Link Function:                          LogLink Log Likelihood:                                 -7087.5554
Number of Samples:                         2259 AIC:                                            14725.6429
                                                AICc:                                           14802.9481
                                                UBRE:                                               2.9436
                                                Scale:                                                 1.0
                                                Pseudo R-Squared:                                   0.6643
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
te(0, 1, 2)                       [0.001 0.001 0.001]  729          275.3        0.00e+00     ***         
intercept                                              1            0.0          8.06e-02     .           
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.
/var/folders/fq/t1vhfgh95v33mglpnsqgnyyh0000gq/T/ipykernel_9609/4283885535.py:1: UserWarning: KNOWN BUG: p-values computed in this summary are likely much smaller than they should be. 
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  M2.summary()

d) Compare M1 and M2. Note that you can use the $R^2$, AIC values, test error, among other approaches discussed in class. Which model should we select?

In [22]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

y_pred_m1 = M1.predict(X_test)
y_pred_m2 = M2.predict(X_test)

r2_m1 = r2_score(y_test, y_pred_m1)
r2_m2 = r2_score(y_test, y_pred_m2)

print("R-squared for M1:", r2_m1)
print("R-squared for M2:", r2_m2)

print("AIC for M1:", M1.statistics_['AIC'])
print("AIC for M2:", M2.statistics_['AIC'])

mse_m1 = mean_squared_error(y_test, y_pred_m1)
mse_m2 = mean_squared_error(y_test, y_pred_m2)

print("MSE for M1:", mse_m1)
print("MSE for M2:", mse_m2)

print("UBRE for M1:", M1.statistics_['UBRE'])
print("UBRE for M2:", M2.statistics_['UBRE'])

print("Log Likelihood for M1:", M1.statistics_['loglikelihood'])
print("Log Likelihood for M2:", M2.statistics_['loglikelihood'])

print("deviance for M1:", M1.statistics_['deviance'])
print("deviance for M2:", M2.statistics_['deviance'])
R-squared for M1: 0.6033590739516503
R-squared for M2: 0.5668373036806489
AIC for M1: 14538.101402113918
AIC for M2: 14725.642880036276
MSE for M1: 31.934568944096114
MSE for M2: 34.875029481790506
UBRE for M1: 2.807954648511525
UBRE for M2: 2.943558678615654
Log Likelihood for M1: -7142.269649208129
Log Likelihood for M2: -7087.5553555657225
deviance for M1: 1470.1826058108095
deviance for M2: 1360.754018525999

As per the scores,

  • R-squared : M1 has higher score and hence better.
  • AIC: M1 has lower score and hence Better.
  • MSE: M1 has lower MSE and hence Better.
  • UBRE: M1 has lower UBRE hence better.
  • Log Likelihood: M1 has better Log Likelihood and hence better.
  • Deviance Explained: M1 has higher deviance Explained and hence better.

Overall M1 is better model than M2 and therefore we should select M1

In [ ]: